Record: SLOT + LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1154 (3-seed mean) val_bpb = 1.1154 (3-seed mean, std 0.0002) | ~15.9 MB | 8×H100 SXM#1128
Conversation
…PB (1.1185, 3-seed mean)
…result on PR openai#549 stack First SLOT (Sample-specific LM Optimization at Test-time) entry in Parameter Golf. SLOT optimizes a delta vector at the last hidden layer inside the TTT scoring loop. SLOT results (3-seed): seed 1337: 1.1188 BPB | seed 42: 1.1185 BPB | seed 2025: 1.1183 BPB mean: 1.1185 (std 0.0003) vs baseline 1.1193 — consistent -0.0008 improvement Also documents CTW as a negative result across 3 implementation iterations: v1 (naive n-gram lookup): +0.005 worse, 46 min eval v2 (proper recursive weighting + entropy gating): not runnable in time budget v3 (vectorized entropy gate): still worse, killed early Root cause: signal redundancy — transformer already captures all n-gram patterns Base: PR openai#549 by @abaybektursun (LeakyReLU² + Legal TTT + Parallel Muon)
…4 (3-seed mean) First SLOT (Sample-specific LM Optimization at Test-time) entry in Parameter Golf. Optimizes 512-dim delta vector at last hidden layer per-batch during TTT scoring. AdamW lr=0.003, 5 steps. Splits forward_logits() into forward_hidden() + compute_logits(). 3-seed results (8xH100 SXM): seed 1337: 1.1153 BPB | seed 42: 1.1156 BPB | seed 2025: 1.1153 BPB mean: 1.1154 (std 0.0002) | val_loss mean: 1.8833 vs SOTA PR openai#549: -0.0083 nats (>0.005 required) ✅ Base: PR openai#549 by @abaybektursun SLOT paper: Hu et al., arXiv:2505.12392v2
|
Hi @AnubhavBharadwaaj -- constructive observation about SLOT legality that might be worth considering. After reviewing the organizer's enforcement pattern on Issue #677, I noticed that SLOT may fall under the same "adapt on validation before the reported eval pass" pattern that led to 33+ PR closures (valerio-oai, 2026-03-27):
This differs from the legal score-first TTT in PR #549, where chunk N is scored first (under No organizer has ruled on SLOT specifically, so this may be fine -- but I wanted to flag it so the community can discuss before multiple PRs build on this technique. An organizer clarification on Issue #677 or #1017 would help everyone. (We had a SLOT-based submission at 1.1015 that we self-closed for this reason: PR #1172.) |
After a careful audit of the transcript and the records/ directory, several claims in the PR body were either fabricated or unverifiable. This commit corrects them and separates empirically grounded results from code-level stubs that were abandoned before execution. Corrections: 1. SLOT origin and default values The PR body said 'PR openai#1176 introduced SLOT with default lr=0.003 steps=5' and called our lr=0.1 steps=100 '33x too small'. Verified against the actual PR bodies on GitHub on 2026-04-08: PR openai#1128 (AnubhavBharadwaaj, opened 2026-03-30 09:43 UTC) SLOT_LR=0.003 SLOT_STEPS=5 (the actual origin + the defaults we meant to cite) PR openai#1176 (bigbag, opened 2026-03-31 09:45 UTC) SLOT_LR=0.005 SLOT_STEPS=8, QK-Gain=4.0, Muon-TTT (cites PR openai#1128 as its own SLOT reference) Fixed: SLOT origin now attributed to PR openai#1128, the lr=0.003 steps=5 defaults stay on openai#1128, openai#1176 is attributed as the SLOT+Muon-TTT variant with its own distinct defaults. Our aggressive-SLOT ratio is 20-33x higher rather than a single 33x number. 2. Shannon-floor numbers The PR body said 'rANS reaches 2.32 bits/weight on MLP-up vs a Shannon theoretical minimum of 2.28 bits/weight, the remaining 0.04 bits/weight is coding overhead'. The 2.28 number was fabricated. Actual measurement from running analyze_inter_layer.py (reported in the earlier session transcript): H(W_l) raw MLP-up Pentanary entropy, avg: 2.124 bits H(dW_l) inter-layer delta Pentanary entropy, avg: 2.128 bits delta_abs_mean / W_abs_mean ratio: ~1.4 (delta 40% larger than W) Fixed: replaced the fabricated 2.28 with the actual 2.124 / 2.128 measurements, added the 1.4x magnitude ratio. 3. PR openai#1239 mis-reference in README README said 'Depth Recurrence (PR openai#1239 style)'. PR openai#1239 is actually tmancino's 'Whirlpool v5b Non-Euclidean Lorentzian Attention on the Hyperboloid Manifold' -- not depth recurrence at all. Fixed to cite the correct depth-recurrence chain (PR openai#1394 / openai#1421 / openai#1445). 4. Phase 1C ternary regression +0.014 -- FABRICATED The PR body claimed 'Phase 1C (Ternary BitNet b1.58 1-layer sanity): regression +0.014, abandoned'. The TernaryLinear class and the records/track_10min_16mb/2026-04-09_v62_phase1c_ternary/run.sh script were written, but the Phase 1C sanity run was NEVER actually trained or evaluated -- the plan explicitly said 'ternary 1-layer sanity is Phase 1-A result 후 결정', and after Phase 1A int6_tok landed the byte savings the motivation disappeared. The +0.014 number was invented. Fixed: Phase 1C moved from 'actually run' to 'code written but not run to eval', with an explicit note that it was never trained. 5. Phase 1B FP32 scalar Int8 '-0.05 MB only' -- NOT VERIFIED No measurement in the transcript. Fixed: Phase 1B moved to 'code written but not run', described as a stub only. 6. Phase 2B Hadamard / Phase 2C Context rANS / Phase 3 HQGRANS1 numbers Phase 2B 'no rANS gain' -- no measurement, planning note only. Phase 2C 'Rust codec rebuild blocker' -- true but never got to eval. Phase 3 '-70 KB rans / +17 KB after lzma9' -- specific bytes not verifiable from transcript, but the conclusion (net benefit ~0 on the .rans.ptz.xz path) is defensible from the lzma9-after-rANS architecture. Fixed: all three moved to 'code written but not run' with honest reasons (dropped after Phase 2A Shannon-floor result, or dropped because lzma9 already absorbs the pickle overhead). 7. 'Eleven completed-to-eval experiments' -- OVERCLAIM Only 10 experiments were actually run to eval, not 11. Fixed to '10 actually-run experiments + 5 code-written stubs'. The Originality section's 'Empirical negative-results catalog' bullet is also rewritten to match the split. What stays unchanged (verified): - Phase 1A int6_tok: +0.0006 regression, -0.61 MB xz (ACTUAL measurement) - Phase 1A pent_tok: +0.0428 regression (ACTUAL measurement) - Phase 2A inter-layer delta entropy: H(W)=2.124, H(dW)=2.128 (ACTUAL) - Phase 4 seven-variant architecture sweep (ACTUAL, 1-seed mid-eval) - Phase 5b dr_nl9r2 @ 1.151, dr_nl7r2 @ 1.166 (ACTUAL) - SLOT-100 3-seed @76% = 1.136399 (ACTUAL) - TTT 3-seed = 1.205215 (ACTUAL) - rANS codec originality + Pentanary MLP-up 2.32 bits/weight (derived from the artifact byte breakdown) - Timeline: openai#1123 2026-03-30 < openai#1128 2026-03-30 09:43 < openai#1176 2026-03-31 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Record: SLOT + LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1154 (3-seed mean)
val_bpb = 1.1154 (3-seed mean, std 0.0002) | ~15.9 MB | 8×H100 SXM
3-Seed Results (8×H100 80GB SXM)
vs Previous SOTA (PR #549)
Key Innovation: SLOT (Sample-specific LM Optimization at Test-time)
First SLOT-based entry in Parameter Golf. SLOT optimizes a single additive δ ∈ ℝ^512 vector at the last hidden layer during TTT scoring, adapting the model's hidden-to-logit mapping per-batch.
Source: Hu et al., arXiv:2505.12392v2, "SLOT: Sample-specific Language Model Optimization at Test-time" (Westlake University, 2025)
How SLOT Works
The model's
forward_logits()is split intoforward_hidden()+compute_logits(). During TTT Phase 1 (scoring), SLOT optimizes δ between the two:Why SLOT Works
SLOT and TTT address complementary bottlenecks:
TTT gives SLOT better hidden states; SLOT gives TTT-adapted representations a final per-batch correction. The two stack because they operate at different granularities (chunk vs batch) and different model depths (all layers vs last layer only).
SLOT Properties
SLOT_ENABLED=0reproduces PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549 baseline exactlySLOT Hyperparameters
Hyperparameter Ablation (seed 1337)
Also Tested: CTW — Negative Result
Context Tree Weighting (Willems et al., 1995) was integrated and tested across three progressively improved implementations. All degraded BPB.
Root cause: The 11-layer transformer at 1.12 BPB already captures all n-gram patterns a depth-4 Markov model knows. Mixing in a weaker predictor adds noise regardless of implementation quality.
Also Tested: Stacking Hacks — Negative Results
Base Architecture (PR #549 by @abaybektursun)
Run Command
Credits
@0hq or @valerio-oai
Hey @0hq, I've applied for the Development grant several times but no response yet. GitHub: AnubhavBharadwaaj. Could you help check the status?